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Abstract 

We propose a model for functional data registration that extends current inferential ca¬ 
pabilities for unregistered data by providing a flexible probabilistic framework that 1) allows 
for functional prediction in the context of registration and 2) can be adapted to include 
smoothing and registration in one model. The proposed inferential framework is a Bayesian 
hierarchical model where the registered functions are modeled as Gaussian processes. To 
address the computational demands of inference in high-dimensional Bayesian models, we 
propose an adapted form of the variational Bayes algorithm for approximate inference that 
performs similarly to MCMC sampling methods for well-defined problems. The efficiency of 
the adapted variational Bayes (AVB) algorithm allows variability in a predicted registered, 
warping, and unregistered function to be depicted separately via bootstrapping. Tempera¬ 
ture data related to the El-Nino phenomenon is used to demonstrate the unique inferential 
capabilities for prediction provided by this model. 

Keywords: Bayesian Modeling, functional data, functional prediction, registration, smoothing, 
variational Bayes 


1 INTRODUCTION 

This paper introduces a novel approach to functional data registration within a Bayesian hier¬ 
archical model. In this model, both registered and warping functions are modeled in terms of 
Gaussian processes with registration described by restrictions on the covariance of the registered 
functions. This enables both MCMC and variational Bayes methods to be employed for estimation 
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and prediction. Our model for registration extends current inferential procedures to include both 
1) functional prediction in the context of registration and 2) registration and smoothing in one 
model. 

The primary advantage to our proposed registration model in comparison to current regis¬ 
tration methods is that it provides a probabilistic framework in which new observations can be 
considered. Assuming a new unregistered function has been partially recorded, with this frame¬ 
work, we can obtain estimates of the registered partial function and also the corresponding partial 
warping function. Using these estimates, the complete registered function, the complete warping 
function, and the complete unregistered function can be predicted. Details of the prediction model 
can be found in Section 5. To the authors’ knowledge this is the first time functional prediction 
is considered in the context of registration. 

Additionally, our model can be extended to allow for noisy observations. Most current regis¬ 
tration methods consider functional regularization as a pre-processing step with the exception of 
Raket et. al. [2014]. However, in the paper by Raket et. al. [2014], the authors’ maximum likeli¬ 
hood approach to registering latent functions does not provide an easy way to quantify variability 
in the estimates of the registered functions. In Appendix C.3, we provide an illustration of how 
smoothing functions in a pre-processing step can significantly underestimate the variability in the 
estimates of the registered functions. 

The simplest definition of functional data registration is any algorithm that aligns functions 
in a way that eliminates all phase variability between functions ( Ramsay and Silverman [2005]). 
Without registration, basic summary statistics such as the sample mean and covariance are less 
interpretable as time variation between significant features in the functions tends to dampen the 
amplitude variation in these features. Furthermore, the average timing of significant features may 
also be of interest and is difficult to obtain under traditional methods of analyzing functional data. 
There has been much recent interest in proper ways to define and measure registration as well as 
in developing registration methods with desirable statistical properties. 

The evolution of registration dates back to Sakoe and Chiba [1978] where the authors use 
a dynamic programming algorithm for landmark registration. Landmark registration was again 
considered by Kneip and Gasser [1992] and Kneip and Gasser [1995]. In 1997, a new cost function 
for functional registration was introduced by Wang and Gasser [1997]. A significant advancement 
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in registration literature can be traced to Silverman [1995] and Ramsay and Li [1998] where the 
authors introduce global registration procedures, and Ramsey considers the use of a flexible family 
of monotone warping functions. Parametric and B-spline base warping functions are considered 
by Brumback and Lindstrom [2004] and Gervini and Gasser [2004], respectively. Nonparametric 
maximum likelihood approaches to registration are considered by both Ronn [2001] and Gervini 
and Gasser [2005]. A moments based approach to registration is introduced by James [2007]. 
Tang and Muller [2008] propose pairwise curve synchronization. The first Bayesian approach to 
registration can be found in Telesca and Inoue [2007]. Registration to principal components is 
considered by Kneip and Ramsay [2008]. Finally, with regard to improvements in registration, the 
recent work by Srivistava, et.al. [2011] offers the most comprehensive framework for registration 
to date. 

Much of the focus in combining registration with other types of inference in one model has 
been in the area of functional data clustering and registration. Current work in this area can be 
found in Liu and Yang [2009], Sangalli et. al. [2010], and also a Bayesian approach in Zhang and 
Telescar [2014], Recent work by Raket et. al. [2014] includes a model for functional smoothing and 
registration. While these extensions to registration procedures offer additional tools for functional 
data analysis, they tend to focus less on high-quality registration. 

In this paper, we develop Bayesian hierarchical models that address both areas of development 
in registration procedures. First, a model is proposed to register functional data that gives es¬ 
timates that compare favorably with those from the best current registration methods available, 
notably, Srivistava, et.al. [2011]. Then, we demonstrate how this model can be extended to incor¬ 
porate other inferential procedures. The two examples provided in this paper are extensions for 
both a functional prediction model and a model for simultaneous registration and smoothing. 

This paper also addresses the computational issues associated with high-dimensional Bayesian 
hierarchical models. To this end, we propose an alternative algorithm to variational Bayes ap¬ 
proximation that can be used for models in which the full conditional distributions of a subset of 
the parameters are not from a known parametric family. To distinguish our algorithm from pure 
variational Bayes, in this paper we will refer to this approximate inference procedure as Adapted 
Variational Bayes (AVB). 

This paper is organized as follows. Section 2 presents our basic registration model. The 
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Adapted Variational Bayes algorithm is discussed in detail in Section 3. A comparison of results 
from our model to current methods can be found in Section 4.1. Additionally, a comparison 
of results obtained using AVB and those given by MCMC can be found in 4.2. The prediction 
model is presented in Section 5. In Section 5.4, the prediction model is used to forecast the future 
trajectory of sea-surface temperatures that are associated with the El-Nino phenomenon. An 
adaptation to our model that allows for noisy data is found in Appendix C. Finally, a discussion 
is found in Section 6. 


2 GAUSSIAN PROCESS MODELS FOR REGISTRA¬ 
TION 

The functional registration models proposed in this paper are foremost designed to extend and im¬ 
prove on the minimum eigenvalue registration criterion for continuous registration first introduced 
by Ramsay and Li [1998] . Accordingly, we will consider two functions perfectly registered if the 
variation between the two functions can be described entirely in terms of one functional direction - 
the target function. Our method of registration improves on Ramsay and Li’s Procrustes method, 
Ramsay and Li [1998], by implicitly accounting for vertical shifts between registered functions and 
by allowing the target curve to evolve throughout the registration procedure. In Section 4, we will 
demonstrate how using the minimum eigenvalue criterion under these conditions provides a more 
complete curve registration. Our results are comparable to those of Srivistava, et.al. [2011]. 

The theoretical basis for modeling functional data as Gaussian processes in a hierarchical 
Bayesian environment is established in Earls and Hooker [2014], In our registration model, each 
registered function, Aj(h;(f)), i — 1,..., N, is the composition of an observed unregistered func¬ 
tion, Xj(t), with an unknown warping function, hi(t), over some fixed time domain T = [ti,t p }- 
The function hi(t) is represented as hi(t) — t\ + f* exp (wi(s))ds to enforce its monotonicity where 
Wi(t) is specified as having a Gaussian process distribution, resulting in a functional random effect. 
In this paper, we will refer to ■uy(f) as the base function associated with warping function h^t). 
The base functions are non-parametrically specified for optimal registration. We, however, impose 
the following restrictions on the warping functions: 

1. h(t i) = ti, 
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2 . h(t p ) = t p , and 

3. if tk > tj, then h(tk) > h(tj ) for all t 3 G T. 

Restrictions (1) and (3) are built into the definition of hi(t). Restriction (2) is imposed through 
the characteristic function in the expression for the prior defined for each base function, Wi(t). 
Note that Wi(t) = 0 corresponds to the identity warping, = t. An important feature of our 
definition of the warping functions is that it defines an identifiable relationship between h t (tj) and 
Wi(t) which is necessary for predicting future outcomes of curves that are only partially observed. 
In Section 5 is a more thorough discussion of the prediction model. 

We also model each registered function, X^h^t)), i = 1... N, as a Gaussian process such that 

Xi(hi(t)) | z 0i iZu,f(t) rsj GP(z 0i + ^ij/(t), 7 ^ 1 S(s,t)), s,teT. (1) 

Here, f(t) is the target function. The target function serves as the primary functional direction 
in which the registered functions vary. Accordingly, the above covariance function, r y^ 1 'E(s,t), is 
defined to penalize all variation in the registered functions beyond a scaling and vertical shifting of 
the target function (the function-specific mean). In these models we will define 7 # as a registration 
parameter that determines the severity of this penalty. 

Given a sample of unregistered functions, X t (t), i = 1...N, defined over the interval T = 
[ti,t p ], we are interested in estimating the warping functions, hi(t), the shifting and scaling pa¬ 
rameters, z 0i and z u , the target curve, f(t), and the registered functions. 

For now, we will assume the functions are recorded without noise. If the functions are recorded 
with noise, it is common practice in the current literature to first perform a pre-processing smooth¬ 
ing step. An undesirable result of this pre-processing step is that the subsequent inference pro¬ 
cedure is unable to capture the extra variability associated with the smoothing process. This 
phenomenon is illustrated using the Berkeley boys growth velocity data in Appendix C.3. Appen¬ 
dices C.l and C .2 detail how our basic registration model can be modified to both smooth and 
register functions. 

Inference is accomplished through a Bayesian hierarchical model. The distributional assump¬ 
tions and prior specifications for this model are 
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Xi(hi(t)) | Zoi,z u ,f(t) ~ GP(z 0 i + z li f(t),j Jt 1 E(s,t)), s,teT, i = 


S(s,t) 


hi(t) 


Wi(t) 


oc 


Pl(s,t ) + ^2(5,t), 

h+ exp(wi(s))ds, teT , i = l,...,iV, 
J ti 


GP(0,^ w 1 'E(s,t) + A^P^s, t))l{ti + 
s, t G T, i = 1,..., IV, 


exp(u/(s))d,s 


'ti 


Pj 
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1 ^2 
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2 
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2 
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~ GP(0,X f (s,t)), 

s,t e T, 

S/I 

{s,t ) 

= riJ 1 P 1 (s,t) + Xj 



Vf 

~ G(c,d), and 



A/ ~ G(c,d). 


N -1 

2/)7V ^ ^ Zgi, 

2=1 


( 2 ) 

(3) 


(4) 

(5) 


( 6 ) 

(7) 

( 8 ) 

(9) 


For this model, the parameters, z^, i — 1,... N, allow the registered functions to vary by ver¬ 
tical shifts from a scaling of the target function, f(t). The constraint, Zqn = — z oii ensures 

the average vertical shift is estimated to be 0. The parameters, Zu,i = 1,... N , quantify ampli¬ 
tude variation in the registered functions. Note, the Gaussian distribution on z x = (z\\... Z\n)' 
can be replaced by a Dirichlet distribution on Zi/N. The result is a slightly more complicated 
model that has the nice effect of scaling the target function to the empirical mean of the estimated 
registered functions. Priors (6) and (7) would then be omitted. 

In the above model specifications, all covariance functions are composed of a linear combina¬ 
tions of two bi-variate functions, P\(s,t) and P 2 (s,t). P\ (s, t) penalizes variation in constant and 
linear functions and P^s^t) penalizes function variability in all other directions. Together they 
define a proper covariance function. For each covariance function above, the specification of the 
registration and smoothing parameters indicate the extent the two different types of variability 
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should be penalized for each function. For example, for both the registered functions and the base 
functions, we want to penalize variation in any direction other than that of the mean function. The 
covariance specifications of 7 ^ 1 E(s, t) and 7 ~ 1 S(s,t) reflect these penalties, where the magnitude 
of the penalty is controlled by registration parameters, 7 # and 7 ™, (distributional assumptions 
2,3, and 4). We can use P 2 (s,t) to penalize roughness in a given function. Here we would like 
both the target function and the base functions to be smooth. This is achieved by the inclusion 
of XfP 2 (s,t) and X~ 1 P 2 (s,t ) in the priors for these functions (distributional assumptions 8 , 9, 4, 
and 5) where the level of the penalty is controlled by the smoothing parameters A f and A^. For 
the exact definitions of Pi(s,t) and P 2 (s,t), see Earls and Hooker [2014], 

Given the above model, in practice we will proceed by using finite approximations to each 
function and functional distribution. In Earls and Hooker [2014] we establish some theoretical 
properties of these types of approximations. The following finite-dimensional distributions are 
used in the final model in lieu of their infinite dimensional counterparts above: 


Xi(hi) | zoi,z u ,f ~ Np(z 0 il + 2 iif, 7 j t S), i = l...N, 


( 10 ) 


h ,(tj) = 4 + - 4-i) exp(wj(4_i)), i = 1... IV, j = l...p, 

k=2 

P 

W i oc iVp_i(0, 7 ~ 1 £ + A“ 1 P tu )l{fi + J^(4 - 4- 1 ) exp(tu i (4_ 1 )) = 4 }, 


k =2 


i = 1... IV, 

f | 77 /, Xf ~ N p (0,Hf), and 
Hif = 7]f 1 Pi + Xf 1 P 2 . 


( 11 ) 


Section 4 provides several examples that illustrate how allowing the target function to be 
estimated within the model results in a more complete functional registration in comparison 
to the Procrustes method, Ramsay and Li [1998]. However, the Gaussian process model does 
not constrain the timing of a feature in the target function to occur at the average time of the 
corresponding unregistered features. Although the model for the Wi(t) is centered on zero, it 
is still possible that the average of the estimated warped time points, h.(ti),. . does not 

correspond to the original time points. Shifting these by an additional registration so that the 
warped times average to the original time does not affect our prediction model, but it will then 
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allow an explicit comparison of h t {tj) to tj to tell us whether the process is running ahead or 
behind “standard” time. Srivistava, et.al. [2011] use a similar “correction” to determine their 
target function. Details on how to perform this final registration can be found in Appendix A.3. 

2.1 Registration and Warping Parameter Selection 

Registration in this model is controlled by three parameters, 7 r, 7 ^, and X w . The parameter 7 ^ 
determines the extent the registered functions will be penalized for varying from a shifting and 
scaling of the target function. This penalty for lack of registration is tempered by penalties for 
roughness in the warping functions. The parameter determines how far the warping functions 
can deviate from the identity warping while X w controls the smoothness of the warping functions. 
This model can also be adapted to allow for function specific warping penalties. In Section 5.4, 
we will give an example where function specific penalties for the base functions have been utilized 
to preserve significant covariance relationships in the estimated registered functions. 

For a given statistical analysis, the registration parameters are chosen by the user. This is 
because variation in registered functions outside of shifts and rescaling f(t ) (controlled by 7 r) 
is nearly non-identihable from variation in warping functions, as regulated by 7 ^ and A^. That 
is, there are linear directions of variation in the w{t ) that result in nearly linear directions of 
variation in the resulting Xi(t) (a simplified example is the identity sin(f + 8) — cos(<5) sin(t) + 
sin(5) cos(t) which confounds horizontal shifts with vertical variation in periodic functions). An 
exact description of this form of confounding is beyond the scope of this paper, but it yields 
unstable estimates when these parameters are not fixed and we therefore choose these by hand. 
In this model a large registration penalty, 7 #, in comparison to the penalty on warping, 7 ^, will 
result in registered functions that no longer retain significant features in the data. Alternatively, 
a registration parameter that is too small will not properly align features. Desirable values of 
these parameters can be determined using short runs of the adapted variational Bayes algorithm 
described in Section 3.1. I 11 practice, we have found these penalties should be adjusted by powers 
of ten to see a significant change in estimates of the registered functions. Once determined, 7 ^, 
7 w , and X w are fixed and can be used with the adapted variational Bayes estimates to initialize 
an MCMC sampler. 
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3 VARIATIONAL APPROXIMATION 


3.1 Variational Bayes 

For our registration model, it is appropriate to use Markov Chain Monte Carlo (MCMC) methods 
to sample from the joint posterior distribution of all unknown parameters. However, for most 
applications, the dimensionality of the parameter space will require exceptionally long chains that 
are impractical and expensive to obtain. Here, we suggest a variational Bayes alternative to MCMC 
sampling to at the very least obtain good starting values for a MCMC sampler. Alternatively, we 
will show in Section 4.2 that differences in the estimated parameters obtained through adapted 
variational Bayes and MCMC sampling tend to be small, and estimation via adapted variational 
Bayes alone is likely sufficient for many inferential procedures. 

The variational Bayes procedure described here is based on the variational methods proposed 
by Ornerod and Wand [2010] and Bishop [2006]. Their proposed method optimizes a lower bound 
of the marginal likelihood which results in finding an approximate joint posterior density that has 
the smallest Kullback-Leibler (KL) distance from the true joint posterior density. Both fixed form 
and nonparametric forms of variational Bayes algorithms are currently available. The variational 
Bayes algorithm that we propose is most closely related to fixed form variational Bayes. A clear 
explanation of fixed form variational Bayes can be found in Goldsmith et. al. [2011] where the 
authors utilize variational Bayes for a functional regression model. 

Suppose q(G) is the approximated posterior joint distribution. The fixed form variational 
Bayes algorithm assumes for some partition of 6 = { 0 1; ..., 0 d }, q{0) = qk{9k), where each 

distribution q^ is of a known parametric form. Traditionally this requirement is satisfied through 
the use of conditionally conjugate priors. 

In our model, the Gaussian process priors for the base functions, Wi(t), i = 1 ,,N, are not 
conditionally conjugate to the likelihood function. Therefore, the fixed form variational Bayes 
optimization method does not apply directly since ?fc(wj), i — 1,..., N are not known parametric 
distributions. 
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3.2 Adapted Variational Bayes (AVB) 

Suppose we order the parameter vector, 6. so that, 0 = {wi,..., w#, 6n+i, ■ ■ ■ ■, &d}, for k = 
{(N + 1),... ,d}. While the q distributions on the approximated base functions, w,, i — 1,..., N 
are not of known parametric forms, each qk(9k), k = {(N + 1 ),... ,d}, is a known parametric 
distribution that can be estimated using the standard fixed form variational Bayes algorithm. 
The following variational Bayes algorithm is adapted to include estimation for parameters without 
conditionally conjugate priors in addition to all other parameters that typically can be estimated 
using fixed form variational Bayes. This adaptation is similar to the variational approximation 
to the EM algorithm described by ? which performs approximate inference based on the EM 
algorithm for models where the posterior distribution of the latent variables is of an unknown 
form. 

The Adapted Variational Bayes Algorithm 

Define /(X, w, 0) as the joint distribution of the data, X, and parameters of a Bayesian hi¬ 
erarchical model. Suppose an approximate joint posterior distribution of the parameters, 6 = 
{wi,..., wjv,0jv+i, • • • , 04 , is of tlie f° rm nf=intw + i^( w j)^(^) where each qj(wj),j = 

1 ,...,N, is known only up to a constant of proportionality and the distributions, qk{9k),k = 

N + 1,... ,d, are of known parametric forms. We will define the following estimation procedure 
for 9 as the adapted variational Bayes algorithm. 

1. Initialize 6. 

2. For each iteration, m, and each k, k = 1,..., N, update the estimate for 

Wfc so that wj, m) = sup Wfc g fe (w fc | 0 i j "‘ l \j = (N + 1),... ,d). This is equivalent to setting 
w (m) = sup W /(X, w | j = (N + 1),..., d). 

3. For each iteration, m, and each k, k — (A+l),..., d, update qu so that q^ n> oc exp[E(e_ fc )(log f(0k 

rest)], where the expectation is taken with respect to the distributions q "' j = 

1 , • • • id, j k. 

4. Repeat steps (2) and (3) until the desired convergence criterion is met. 

Here the notation, E^g_ k ' ) , denotes the expected value over all parameters except 9^. In the next 
section, we will drop the subscript k, and E(g_ g j will represent the expectation over all parameters 
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except for 9 (e.g. i±7(0 ) will represent the expectation taken over all parameters except for rjf). 

Theorem The adapted variational Bayes algorithm converges to estimated parameters, G, that 
minimize the Kullback-Leibler distance between the approximate posterior distribution, q(G_ Vf ), 
and the posterior distribution, /(0_ w | X, w) ; for a local optimization o//(X, w | 0_ w ) in w. 

Proof: Assume w is known. Goldsmith et. al. [2011] demonstrate that minimizing the K-L 
distance between q(G_ w ) and /(0_ w | X, w) is equivalent to maximizing the following log of a 
q-specific lower bound of the joint marginal distribution of X and w in q. 

l°g(X, w; q) = I g(0_ w ) log | | dfl-w 

= ^g(O-w) [l0g[/( X , W, 0_ w )] - log[g(0_ w )]] 

The adapted variational Bayes algorithm alternates between: 1) maximizing /(X, w | 0_ w ) 
in w (possibly locally), and 2) fixing w at the value determined by the previous step and using 
traditional variational Bayes to maximize /(X, w;g). Here we demonstrate this process results 
in a monotonic increasing sequence in log/(X, w; g) which guarantees the convergence of this 
algorithm. 

For each iteration, m of our adapted variational Bayes algorithm, 

log/(X, = £,(s_ w )(log[/(X, eL”’)] - log[«(»S)]] 

< B»(«-)[iog[/(x.w("* +1 '.eL’5)]-iog[9(e«)]] (12) 

< B,( 8 -. ) (iog[/(x,w (m+i| .eL , y i) )) - io g [i,(<y” +1) )]] (13) 

= log/(X,w<’"+ 1 >; 9 <"'+ 1 '). 

The inequality in (12) is guaranteed by step 2 of the adapted variational Bayes algorithm, and 
the inequality in (13) is the result of using the traditional variational Bayes algorithm with w 
considered known (step 3). 

The lower bound of the marginal distribution of X and w can be monitored until changes 
in this function are under some threshold. The specific form of this function can be found in 
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Table 1: Computational Time (seconds) 


Method 

Sim 

BGV 

RBGV 

ME 

176 

1636 

NA 

F-R 

1.3 

2.9 

NA 

AVB 

1820 

36540 

28692 

MCMC 

222 

271296 

46037 


Appendix B.2. However, as the algorithm is guaranteed to converge, it is in practice more prudent 
to instead monitor changes in the parameter estimates from iteration to iteration and stop the 
algorithm when these changes are below a specified threshold. 

Convergence of the AVB algorithm is guaranteed. However, convergence to a global maximum 
is not guaranteed. In the maximization step of the AVB algorithm, a function proportional to the 
approximate posterior for the base functions is maximized in the base functions. This function 
can be multimodal and occasionally the estimated base functions reflect a local maximum of the 
approximated posterior. To circumvent this problem, in practice it is sometimes necessary to 
adjust the registration and warping penalties as the functions become registered. An unregistered 
function that requires a substantial amount of warping can cause convergence to a local maximum 
due to the small penalty on warping. The flexibility in warping allowed by this small penalty can 
cause the function to deform rather than register. This can be remedied in two ways. The first 
option is to perform a simple initial warping for this function that prevents the optimization from 
falling into a local mode. The second option is to adjust the registration and warping parameters 
over time. Initially a stronger warping penalty is employed to prevent function deformation. 
Then, as the functions register, the warping penalty can be reduced to allow for a more complete 
registration. When initializing an MCMC sampler, the final penalties on warping and registration 
from the adapted variational Bayes algorithm should be used. 

While AVB estimates are often close to MCMC estimates, often it will still be desirable to 
characterize the posterior distributions of all parameters through MCMC sampling. This is par¬ 
ticularly prudent when noise is present in the observed unregistered functions which introduces 
a significant amount of variability in the posterior samples (see Appendix C.3). We can express 
the computational savings due to using the AVB algorithm as the amount of time saved in burn- 
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in iterations. When applying the adapted variational Bayes algorithm to “real” data such as the 
Berkeley Boys Growth Velocity Data , we have found the adapted variational Bayes algorithm saves 
a significant amount of computational time in the burn-in period. For simulated data, there may 
not be any savings in computational time due to the following: 1) MCMC samples move very 
quickly towards the optimal solution when the registration problem is perfectly defined and 2) 
the maximization step of the AVB algorithm is inefficient, especially in the first few iterations. 
In Table 1, we compare the computational cost of obtaining estimates through MCMC sampling, 
AVB, the F-R algorithm, Srivistava, et.al. [2011], and the ME algorithm, Ramsay and Li [1998], 
for simulated data, the original Berkeley Boys Growth Velocity Data (BGV), and the boys growth 
velocity data corrupted with noise (RBGV). Full descriptions of these datasets and all proposed 
registration methods are described in Sections 4 and Appendix C.3. Since the ME, F-R and AVB 
algorithms do not include variability estimates, for this comparison the MCMC sampler is run 
long enough to obtain estimates but not necessarily long enough to characterize the posterior 
distributions. Note: since the ME and F-R algorithms treat smoothing as a pre-processing step, 
for the noisy growth velocity data we only compare the computational time needed to burn-in an 
MCMC sampler to the computational time required to obtain AVB estimates (that can then be 
used to initialize an MCMC sampler). 

4 COMPARISON TO CURRENT METHODS 

4.1 Comparison to Other Registration Procedures 

Our registration criterion minimizes all variation in the warped functions that is not in the direc¬ 
tion of the target function (allowing for vertical shifts). In this respect, the underlying registration 
principle driving our model is similar to that proposed by Ramsay and Li [1998]. Here we will 
compare our registration results to those using Ramsey’s method as well as the registration pro¬ 
cedure proposed by Srivistava, et.al. [2011]. Srivistava et. al. propose a geometric framework for 
functional data registration using the Fisher-Rao Riemannian metric, Rao [1945]. In this paper 
we will refer to Ramsey and Li’s registration procedure as ME (minimum eigenvalue, Ramsay 
and Li [1998]) and Srivistava’s procedure as F-R (Fisher-Rao, Srivistava, et.al. [2011]), and the 
model proposed here as GP (Gaussian Processes). In the paper by Srivistava, et. ah, several 
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Original Functions Minimum Eigenvalue 




Fisher - Rao 



Gaussian Processes 



Figure 1: Simulated Data Set 2. Top Left Unregistered functions. Top Right Registered func¬ 
tions using the minimum eigenvalue criteria (R package ’fda’). Lower Left Functions registered 
by F-R (R package ’fdasrvf’). Lower Right Functions registered by the GP model. 

comparisons of registration under the F-R framework to the registration methods proposed by 
Gervini and Gasser [2004], James [2007], Liu and Miiller [2004], and Tang and Muller [2008] are 
provided. In all cases, F-R appears to provide the most complete registration of the given set of 
functions. In light of this illustration, we will consider their method as the current frontrunner in 
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registration procedures and use it as the standard for our comparisons. 

Figures 1 and 2 contain the datasets used for this analysis. Each figure includes the original un¬ 
registered data along with plots of the functions registered using the three proposed methods. For 
all three registration methods, a range of parameter values were explored for optimal registration. 

We have chosen to use the Sobolev Least Squares (sis) criterion to compare the three reg¬ 
istration methods for each dataset as advocated by Srivistava, et.al. [2011]. The Sobolev Least 
Squares criterion compares the total cross-sectional variance of the first derivatives of the registered 
functions to that of the original functions. Explicitly, 

, Eli Rx'iiHt)) - jf EjLi x 'j(hj(t))) 2 dt 
sis = - Tf --- -Tf -. 14 

Lower values of sis correspond to better function alignment. 

Simulated Data Set Figure 1 contains the functions of a simulated data set. These data 
consist of 20 unregistered scaled mixtures of three Gaussian probability density functions. All 
three registration procedures result in similar alignments. However, the GP method does a better 
job of recovering the original shape of the functions and results in the lowest sis. Note: The ME 
registered functions are based on 5 complete runs of the ME algorithm where in each run the 
previous runs results were used as the ‘unregistered’ functions. 

Berkeley Boys Growth Velocity Data Figure 2 contains 39 velocity of growth functions 
for boys from the Berkeley Growth Study, Tuddenham and Snyder [1954]. For this analysis, the 
original data are slightly changed to eliminate some erratic behavior at the beginning of each 
function. Here, GP and F-R yield similar registration results. However, the GP algorithm results 
in the lowest sis. ME registers the most significant peak in growth velocity but does not align lesser 
features as well as GP. Note: The ME registered functions are based on 2 complete runs of the ME 
algorithm where in each run the previous runs results were used as the ‘unregistered’ functions. 
Running this algorithm more than twice resulted in function distortion due to over-warping and 
a larger sis. 

While the GP and F-R methods result in a similar alignment of functions, these results are 
achieved in very different environments that are specialized to satisfy specific inferential prefer¬ 
ences. The F-R registration method is convenient (using R package ‘fdasrvf’) and provides fast 
high-quality estimates. On the other hand, while providing comparable registration results, our 
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method expands inferential capability by providing 1) variability estimates for all unknown pa¬ 
rameters and 2) a probability framework in which future partially observed unregistered functions 
are considered. In contrast to traditional functional prediction methods, our model not only pro¬ 
vides an estimate of the complete unregistered function, but also estimates the complete warping 
function and the complete registered function. Details of the prediction model are found in Section 
5. 

4.2 Comparison to MCMC Results 

To establish the utility of the adapted variational Bayes algorithm, here we compare the estimates 
of registered functions using adapted variational Bayes versus those obtained through MCMC 
sampling. For this exposition, the simulated data set and the Boys Growth Velocity data set 
described in Section 4.1 are used to look at the discrepancies between the estimated registered 
functions from MCMC sampling versus those determined by the AVB algorithm. 

The squared L 2 norm of the difference between the AVB and MCMC estimate of a registered 
function is used to quantify the differences between these estimates. Figure 3 illustrates for the 
simulated data set how closely the AVB estimates follow the MCMC estimates. Even the largest 
squared L 2 norms of the differences between these two estimates correspond to minor changes in 
the estimates. These simulated data represent rather ideal conditions for registration where there 
is almost no variation in the registered functions beyond a scaling and vertical shift of the target 
function. Consequently, as we might expect, the MCMC and AVB estimation procedures are 
primarily in agreement. Figure 4 is a more realistic look at the differences between the MCMC 
and AVB registration results for data that has significant variation in the registered functions 
beyond a scaling and vertical shift of the target function. However, even here we see the AVB 
algorithm performs well. Of the 39 observations, in only 2 or 3 are there notable discrepancies 
between the AVB and MCMC estimated registered functions. 

These examples show that while ideally AVB estimates are used to initialize a MCMC sampler, 
they often deviate only in minor ways from the posterior mean estimates obtained from MCMC 
sampling. The next logical step is to compare the variance of the approximated posterior distri¬ 
butions to that present in the MCMC samples. The AVB algorithm does not include approximate 
posterior distributions for the base functions or the registered functions. However, we can compare 
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Figure 2: Registered Boys Growth Velocity. Top Left Original unregistered boys velocity data 
functions. Top Right Boys velocity functions registered using the minimum eigenvalue criteria 
(R package ’fda’). Lower Left Boys velocity functions registered by F-R (R package ’fdasrvf’). 
Lower Right Boys velocity functions registered by the GP model. 

the estimated credible intervals obtained through AVB and MCMC sampling respectively for the 
target function. Here we will provide this comparison for the target functions associated with 
1) the original (noiseless) Berkeley Boys Growth Velocity data and 2) the Berkeley Boys Growth 
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Velocity data corrupted with Gaussian noise (see Appendix C.3 for more information on these 
data). 

Generally the concern with using variational Bayes to approximate the posterior distributions 
is that variability is underestimated in the approximated posteriors, ? and Bishop [2006]. In the 
appendix, Figure C.l contains credible bands determined through MCMC sampling for two of 
the registered functions from the noiseless growth velocity data. As can be seen in this figure, 
for noiseless observations, once the registration parameters have been set, the result of specifying 
a highly informative prior on the registered functions is that most of the variability from the 
mean is eliminated resulting in very narrow credible bands. Of course, if the credible bands 
associated with the registered functions are narrow, it makes sense that credible bands for the 
target function are even narrower. Thus, for the original boys growth velocity data, very little 
variability is exhibited in the MCMC samples of the target function and the differences between 
the estimated 95% credible band determined from the q distribution of the target function and 
that determined from the quantiles of the posterior MCMC sample are so small that on a graph 
they are indistinguishable. In the plot on the left in Figure 5 is a comparison of the width of the 
respective credible intervals for the target function at each time point. Here it can be seen that 
not only are the credible intervals very narrow, but there are only minor differences in the width 
of the credible intervals determined from the approximate posterior distribution (AVB) and the 
MCMC sample. 

As can be seen in Figure C.2 of the appendix, if noisy data are recorded, the variability due 
to the noise process results in wider credible intervals for the registered functions (and hence the 
target function). On the right hand side of Figure 5 is a comparison of the width of the respective 
credible intervals for the target function at each time point for the model that both smooths and 
registers the noisy boys growth velocity data. In this illustration, we can see the variability present 
in the posterior MCMC sample for the target function is not captured well by the q distribution 
associated with the target function. This is not surprising and is an example of when performing 
MCMC sampling with an AVB initialization may be preferred to using AVB alone. Note: for these 
analyses both samplers were initialized with AVB estimates which made burn-in unnecessary. Post 
sampling analysis showed low autocorrelation in the final 999 samples of the target function (after 
thinning) for both the noiseless and noisy data. 
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Simulation 2 - MCMC Versus AVB 
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Figure 3: Simulated Data - Difference Between MCMC and AVB Estimates. Left Plot of the 
squared L 2 norm of the difference between the MCMC and AVB estimates for each observation 
in decreasing order of magnitude. Center and Right The original unregistered function plotted 
with the MCMC and AVB estimates of the registered functions for the observations with the two 
largest discrepancies between the MCMC and AVB estimates. 

5 VARIATIONAL APPROXIMATION FOR FUNCTIONAL 
PREDICTION 

5.1 Functional Prediction Algorithm 

The probabilistic framework of our registration model provides a natural structure in which we 
can consider new observations. Functional prediction has previously been considered by Ferraty 
and Vieu [2006]. Here we extend current methods by taking into account the phase variability of 
a partially observed function. 

We will make the following assumptions. 

1. We have a sample of approximated unregistered functions, X* = (Xj(ti),..., Xi(t p ))', i = 

2. X,; = ..., Xi(t p )y, i = 1,..., N are registered using the registration method outlined 

in Section 2 via a MCMC sampler or adapted variational Bayes. 

3. From (2) we have obtained estimates for the target function, /(£), the registered functions, 
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Figure 4: Registered Boys Growth Velocity - Differences Between MCMC and AVB Estimates. 
Top Left Plot of the squared L 2 norm of the difference between the MCMC and AVB estimates 
for each observation in decreasing order of magnitude . Top Center and Right The original 
unregistered function plotted with the MCMC and AVB estimates of the registered functions for 
the observations with the first two largest discrepancies between the MCMC and AVB estimates. 
Lower Plots of the next three observations with the highest squared L 2 norms of the difference 
between the MCMC and AVB estimates. The squared L 2 norm associated with the lower right 
plot is about .64. As can be seen in this illustration, at this level there are only small differences 
between the MCMC and AVB estimates. 

Xi(hi(t)), i — 1,..., N, the warping functions, hi(wi(t)), i — 1,..., N, a 2 0 , and cr^. 

4. A new function, X N+ i(t) has been observed at the time points (G,... ,t r )', r < p. 

5. (X(h(ti)),.. . X(h(t p )))' ~ A r p (/i X(h ), £ X (h)), the distribution of the registered functions can 
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Figure 5: Pointwise Comparison of Credible Interval Width for the Target Function. Both plots 
contain the width of the credible interval based on the approximate posterior distribution for the 
target function obtained via AVB (triangles) and the width of the estimated credible intervals 
determined by the empirical quantiles of the posterior sample of the target function obtained 
through MCMC sampling (circles) at each time point . Left This is a comparison of the credible 
intervals for the target function when the data are recorded without noise. For most time points 
in this plot, the width of the credible interval at that time point using the approximate posterior is 
almost identical to that obtained through MCMC sampling. The biggest difference can be seen at 
ages 14-16 where the MCMC credible interval is slightly wider. Right This is a comparison of the 
credible intervals for the target function when the data are recorded with noise. Here it can be seen 
that the approximated posterior distribution of the target function significantly underrepresents 
the variability in the true posterior distribution. 


21 















be approximated by a multivariate normal distribution using the sample mean, Ax( h )> and 
sample covariance matrix, £x(h), of the estimated registered functions obtained in (2). 

6 . (w(ti),... w(tp-i))' ~ iV p _i(/E4 w , X w ), the distribution of the base functions can be approx¬ 
imated by a multivariate normal distribution using the sample mean, /i w , and sample co- 
variance matrix, X w , of the estimated base functions obtained in (2). 

Under these assumptions, we will proceed as follows. 

1. Register the partially observed function, X P at+i = (X N+ i(ti),..., X N+ i(t r ))' to the esti¬ 
mated target function, /(f), truncated to an appropriate registration time, tf, f G {1 ,... ,p}, 
so that h,N+i(tf) = t r . 

2. Using the distributions from assumptions (5) and (6) above, the estimate of the partial reg¬ 
istered function, X P Ar+i(hAr+i) = (X^ +1 (fiN+i(ti)), ■ ■ ■, X/) +1 (/pv+i(t/))y and the estimate 
of the partial base function, w p at + i = (w p jv+i(ti),... w P N+i(tf-i))', estimate the registered 
and base functions to time t p and f p _i respectively using the conditional expectation of the 
multivariate normal distribution. Accordingly, denoting future registered observations and 
future warping function values , X F 7 v+i(hjv+i) and w F at+i, respectively, the estimates of 
these future values are 

xVi(fhv +1 ) = £;(X F (h)|X p ^ +1 (h iV+ i), A x(h ),Ex(h)) and 
w f jv+i = U(w f |w p at + i, £w,E w ). 

3. Estimate the complete unregistered function, X N+ i (f), using the inverse of the estimated 
warping function and the estimated registered function. 

5.2 Determining the Last Registered Time 

An additional random element in the prediction model is the last registered time of the truncated 
target function, tf, used to register the partial observation. To obtain the best possible registration 
of the partial observation, a range of final registration times are considered over a finer domain. 
The efficiency of the adapted variational Bayes algorithm makes it possible to consider several 
possible partial registrations as follows. 
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1. For each of the time points tj,j G {m ,,.., (m + k — 1)}, t m+ k -1 < t p , the partially observed 
function, X^ +1 {t), is registered to the estimated target function, /(f), truncated to time, tj, 
so that fiN+i{j)(tj) = t r , where /ijv+i(j)(f) is the estimated warping function determined by 
registering X^ +1 {t) to the proposed final registration time tj. Note, the hrst and last times 
considered in this interval are chosen by plotting the partial unregistered function and the 
target function together and determining a generous interval that contains the appropriate 
hnal registration time. This interval is subsequently made finer to allow this time to fall 
between two of the original time points. 

2. Calculate d t] = ||X P jv+i — (^o(j)l + D(j)f U (j))1 2 for each tj, j G {m,... ,m + k — 1} where 

fU (j) = • • •, ICh^+i^itr) = tj))'. 

3. Set tf — argmin d tj . 

This algorithm determines the hnal registered time, tf, that results in the minimum L 2 norm 
between the partially recorded unregistered function and the target function evaluated at the 
inverse of the warping function estimated using that hnal time. Note, for all j, f U (p shares the 
same domain as the partially recorded unregistered function, X P Ar + i. 

5.3 Confidence Intervals for Predicted Functions 

The efficiency of adapted variational Bayes for prediction also makes it possible to characterize 
variability in the estimates of the complete registered function, unregistered function, and base 
function via bootstrapping. To capture variability in the predicted functions, for each of m = 
1,..., M iterations, perform the following steps. 

1. Draw a new sample of N registered functions from N p (p,^^, £ X (h))- 

2. Draw a new sample of N base functions from N p _i(£i w , S w ). 

3. From the bootstrapped samples determined in (1) and (2), compute the sample mean and 

covariance matrix for the approximated registered functions, an( l ^x(h) an( l also f° r 

the approximated base functions, and \ 


23 



4. Register the partially observed function as in prediction step (1) in Section 5.1 above by 
setting the approximated target function, f, equal to /^xjh) • 

5. Draw a sample of size S from the distribution of X F (h)|X P 7 v+i(h^ 1 ), Ax(h)> ^x(h)- Com¬ 
bine each of these samples of future values of the registered function with the estimated 
partially registered function determined in (4) to get a sample of S estimated registered 
functions. 

6 . Draw a sample of size S from the distribution of w F |w p ^ +1 , \ Combine each of 

these samples of future values of the base function with the estimated partial base function 
determined in (4) to get a sample of S estimated base functions. 

7. Determine the S warping functions that result from step (6). 

8 . Determine the S unregistered functions that result from combining each of the S registered 
functions drawn in (5) with the corresponding inverse warping function from (7). 

This process results in MxS bootstrapped samples of the registered function, warping function 
and unregistered function. From these samples, point wise bootstrapped confidence intervals can 
be determined for each function. 

Here we have approximated the distributions of the base and registered functions by fitting 
a multivariate normal distribution. However, given the small sample size, approximating these 
distributions by a multivariate t distribution as suggested in Lange [1989] may provide confidence 
intervals with better coverage properties. Beran [1990] also provides a method for constructing 
robust confidence intervals in the context of univariate prediction problems. 

5.4 Functional Prediction - El-Nino Data 

The El-Nino data consist of weekly readings of sea surface temperature with the first observation 
in June of 1950. Complete data can be found at NOAA’s Climate Prediction Center website 
(http://www.cpc.ncep.noaa.gov/data/indices/). The data that we are using for this analysis are 
found through Professor Frederic Ferraty’s (Mathematics; University of Toulouse, France) web¬ 
site (http://www.matli.univ-toulouse.fr/ ferraty/ SOFTWARES/NPFDA/npfda-datasets.html). 
These data are a subset of the original data with monthly sea surface temperature records from 
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June of 1950 to May of 2004. For this analysis, the bi-monthly observations are added to the data 
to prevent significant changes to the shape of a given function due to interpolation error. Also, 
light smoothing is applied to all functions. 

The goal of onr study is to predict how high temperatures will stay in the remaining part of the 
year in conjunction with how long temperatures will drop before they rise again based on the first 
seven months of temperature recordings from the lowest temperature recording in the previous 
year. 

For this purpose, the data are restructured to define a “year” as the period of time between 
the lowest temperatures in consecutive calendar years. For example, the first year in our data set 
ranged from September 1950 to September 1951. Note, these “years” will not all be 12 months in 
length, and our final data had “years” that ranged from 11 to 14 months. For our analysis, we will 
concentrate on a subset of this group of restructured functions where the previous year’s lowest 
temperature for each selected temperature profile is between 19.5 °C and 21 °C. Twenty-nine of 
the original temperature profiles fit this criterion. We will use the first 28 functions to predict 
the remaining portion of the 29th function based on the first 7 months of sea surface temperature 
observed in that year. 

For the purpose of registration, all functions need to be recorded over the same interval of time. 
As mentioned above, in this particular case our data is recorded over a time periods that range 
from 11 to 14 months. An easy remedy to this situation is to perform a simple initial warping to 
each function that rescales every observation to an 11 month time frame. In our final analysis, this 
initial warping is accounted for when determining the final base functions used for the prediction 
algorithm. 

The original unregistered functions and the functions registered using the GP model described 
in Section 2 are plotted in Figure 6. For this data set, to register significant features in the 
sample while retaining function variation beyond a scaling and vertical shift of the target function, 
individual warping parameters, r 'f Wi , i — 1,... ,28 were utilized instead of in (23). Significant 
differences in the amplitude variation in the original functions that is unassociated with temporal 
variation prevented the use of a global parameter. However, only 3 unique warping parameters in 
total were necessary. 

Using the empirical mean of the 28 original registered functions as the target function, the first 
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7 months of sea surface temperature records from observation 29 are registered to a piece of the 
target function where the final registered time is allowed to vary from 6.5 to 7.5 months. Between 
these months, a finer time interval corresponding to weekly records is used to allow for additional 
flexibility in determining the final registered time. The partially recorded function is plotted with 
the target function in the lower right panel of Figure 6. The grey shaded area includes the time 
points considered for the final time of the partial registration. After the optimal registration of the 
partially recorded observation is determined, estimates of the entire registered function, warping 
function, and unregistered function are determined using the model outlined in Section 5.1. 

One-hundred bootstrapped samples were used to estimate the variability in the predictions 
of all three estimated functions. Figure 7 plots the initial estimates with the 95% bootstrapped 
confidence intervals. In addition, the plot of the estimated unregistered function also includes the 
true value of this function. 

The primary advantage of registering the partially recorded observation before estimating 
future values is that we can capture variation in amplitude and timing separately. In Figure 7, 
the first plot captures the variability in the future level of sea surface temperature (amplitude 
variation), and answers the question,“How high can we expect sea surface temperatures to stay?”. 
The second plot captures the variability in the timing of future observations (temporal variation) 
which addresses the question of, “When can we expect sea level temperatures to begin rising 
again?”. The confidence interval for the unregistered function seen in the last frame of Figure 7, 
combines both amplitude and temporal variation to estimate the future trajectory of sea surface 
temperature for this year. In this illustration it can be seen that the main difference in the 
estimated and actual temperature profiles lies in the timing of the lowest observation. However, 
for this observation, the sea surface temperature at 12 months is not much different than the sea 
surface temperature at 11.5 months. The predicted timing of the lowest temperature was 11.1 
months. 

One of the most notable features of this analysis is that there is little uncertainty in the 
registration of the first 7 months of sea surface temperatures. The most prominent feature in the 
data is the peak temperature that occurs anywhere from 4 to 8 months in the original data. In 
our partially recorded observation, as seen in Figure 6, the peak of the target function and the 
partially recorded observation are already closely aligned. Additionally, this observation happens 
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El-nino - Original Functions 


El-Nino - Warping Functions 





Figure 6: El-nino Data. Top Left Original 28 profiles of sea surface temperature. Top Right 
Estimated warping functions. As can be seen here, the time period of the original data ranged 
from 11 to 14 months. Lower Left Estimated registered temperature profiles. Lower Right 
The solid line is observation 29 recorded for 7 months. The dashed line is the estimated target 
function. The grey shaded area spans the 5 time points that are considered for the final time of 
the partial registration. 
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Estimated Registered Function with 95% Bootstrapped Cl 


Estimated Warping Function with 95% Bootstrapped Cl 


Predicted Function with 95% Bootstrapped Cl 





Figure 7: Estimates and Bootstrapped Confidence Intervals. Left Estimated registered function 
with 95% bootstrapped confidence interval. Note: In parts of the domain, the estimated confidence 
interval and estimated registered function overlap. This is largely due to a bimodal distribution of 
the last registered time. Center Estimated warping function with 95% bootstrapped confidence 
interval. Right Estimated unregistered function with 95% bootstrapped confidence interval. The 
dashed and dotted line is the true unregistered function. 

to be similar in shape to the target function. The combination of these features resulted in only 
a minimal amount of variation in the estimated registered and warping functions in the first 7 
months. However, we note here, this phenomenon is an artifact of these particular data, and in 
other analyses more variation in the registered timing of the partially recorded observation would 
be expected. 

The El-nino data set provides a challenging registration problem. The registered functions 
vary significantly in directions beyond the target function. Choosing curve specific registration 
parameters enabled features common to all functions to be registered while retaining prominent 
features in each individual curve. This is just one example of the difficulties that can arise in 
registering functional data and in turn how these challenges can be addressed to analyze data that 
does not fit the “ideal” registration problem. 






6 DISCUSSION 


In this paper we have developed a methodology for Bayesian registration that accounts for uncer¬ 
tainty in the registered functions, the warping functions, and the parameters associated with them. 
The hierarchical structure of this model allows multiple inferential procedures to be included in 
one analysis. We give an example where functional regularization and registration are performed 
in one model. However, these models will accommodate any combination of inferential procedures 
for which an appropriate prior can be defined. For instance, the models proposed here can easily 
be extended to a functional linear regression model where the registered functions or registered 
latent functions are considered as covariates. 

While our registration algorithm provides high quality estimates in a highly flexible model, the 
associated computational costs due to running a MCMC sampling scheme for a high-dimensional 
model are considerable. To address these costs, we have proposed the Adapted Variational Bayes 
algorithm. This algorithm has been shown to converge in a similar way to traditional variational 
Bayes, but may not converge to a global maximum. However, an initial warping can be performed 
to move a function closer to its optimal registered value if the algorithm converges to a local 
maximum. 

The bijective relationship between the base and warping functions in this model makes it 
possible to perform functional prediction in the context of registration. Using the estimated 
values of the base and registered functions from an initial registration of N functions, we use 
approximate distributions for these functions to predict future values of the registered, warping, 
and unregistered functions of a new function that is only partially observed. Furthermore, the 
AVB algorithm makes it possible to re-sample from these distributions to bootstrap confidence 
intervals for these predicted values. 

While the AVB algorithm provides estimates that are similar to their MCMC counterparts, 
for some models MCMC sampling remains the optimal inferential procedure. For example, in the 
model that combines smoothing and registration AVB estimates are approximate and preferably 
are only used to initialize an MCMC sampler. While we have used Metropolis within Gibbs 
samplers, these are likely not optimal. Determining the most efficient method of sampling from 
these joint posterior distributions is left for future work. 

For simplicity, we have used inverse-Gamma or Gamma priors for the variance components in 
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these models. The best choice of priors for these components is a uniform prior on the square 
root of the variance components as suggested by Gelman [2006]. This is particularly a problem 
when the variance component is small or there is very little data to estimate these components. 
For the models presented here, Gamma priors for the smoothing parameters are sufficient as small 
changes in these parameters do not significantly affect the model. However, in the analysis of the 
noisy boys’ growth velocity data in Appendix C, we saw some evidence that the “uninformative” 
inverse-Gamma prior resulted in a slightly upward biased estimate of the noise variance. 

Modeling functions as Gaussian processes in a Bayesian hierarchical model offers a unified 
approach to performing multiple inference procedures for functional data within one model. In 
Earls and Hooker [2014], we established the properties of functional data estimates determined by 
approximating functional distributions over a finite subset of observed time points and estimating 
functions at unobserved time points with linear interpolation. Furthermore, we demonstrated that 
providing smoothing information in scale or covariance matrices results in regularized functional 
estimates. Here we extend this work by using informative covariance matrices to register functions 
where the registered functions are modeled as Gaussian processes. Future work includes adapting 
these models for other areas of inference for functional data and continuing to improve on the 
models we have proposed here. 


A 

Below, in detail, are the specifications for the hierarchical Bayesian registration model discussed 
in this paper. The first section includes the basic model for functional data registration also found 
in Section 2. Section A.2 describes the MCMC sampling scheme for this model. 


A.l Functional Registration 

As discussed in Section 2, the initial assumption of this model is that we are interested in regis¬ 
tering functional data, Xi(t), i — 1,..., N, where these data are either observed directly over a set 
of time points, t = (t\ ... t p )', or are estimated from noisy observations, Y (Yj(fi), ..., Yi(t p ))'. 
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We assume a Gaussian noise process such that for each observation Yj, 
f(Yi(tj) | Xi(tj),a 2 ) = N a 2 ) for i = 1,..., N, j = 

which results in the joint distribution of all observations, 

/(Y|x,(7 2 )=nL^p(Xi,a 2 J), 

where Y is the matrix such that the observation for function X,(t) at time point tj is in the 
ith row and the jth column, X is the matrix of the corresponding means for each entry in Y, 
and Xj = ( Xi (iy),..., Xi(t p ))', the vector of evaluations of the functions X t (t) at time points 
t = (ti,...,t p y. 

When the observations are observed noisily, the registered functions and noise variance are 
characterized by the following prior distributions: 

Xj(hi) | Zoi,Zii,f,7fl> A x ~ N p (z 0i l +z li i,'y^ 1 'E + 'Ex), i — (15) 

Sx = Vx lp i + -^v lp 2 , and 
c Jy ~ IG(a,b). 

However, If each function Xi(t) is observed directly over t, (15) assumes the roll of the distri¬ 
bution of the observed data and the covariance matrix, Ex, designed to penalize roughness in the 
unregistered functions is excluded. This results in the following data distribution, 


Xj(hj) | z 0i , zu, f, 7 R , X x ~ N p (z 0i l + z u f, y^ 1 ^), i = 1... N. 


( 16 ) 
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In both scenarios, we assume the following additional priors: 
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where a, b, c, and d are fixed hyperparameters. 


X is a fixed matrix designed to penalize variation in any direction from the corresponding mean 
of the distribution in which it is utilized. It is composed of two matrices, Pi and P 2 , such that 
X = Pi + P 2 . Pi penalizes variation from the mean in constant and linear directions, and P 2 
penalizes variation from the mean in directions of curvature. For the distribution on the registered 
functions, X penalizes variation from a vertical shift and scaling of the target function. In the 
distribution of the base functions, X penalizes variation from the identity warping. The fixed 
parameters 7 # and Vw determine the degree of these penalties for the registered functions and the 
base functions, respectively. 

P 2 is also used to penalize curvature in the registered functions, base functions, and the target 
function with associated smoothing parameters Ax, A w , and A/. Further details of the construction 
of Pi and P 2 are found in Earls and Hooker [2014], Also note, in the prior on the base functions 
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(17), we are allowing P 2 to be more generally interpreted as a penalty on severe deformations of 
the unregistered functions. Here, P 2 may be either a penalty on the squared second derivative of 
the base functions (as above) or a penalty on the squared first derivative of the base functions. 

A.2 MCMC Sampling 

Using these assumptions, the following full conditional distributions are derived to run a MCMC 
sampler: 
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X, I rest ~ N p {(<j~ 2 I p + E* 1 )- 1 ^ 2 !^ + E4(^1 p + (hr 1 )), (a;% + E* 1 )' 1 ), 

f | rest ~ A^p(/i. f |, rest , Sf| res t), 

N 

Sf|« sl = E 4(7^2+ S X )- 1 + S7 1 )- 1 , 

2=1 

N 

Mf|rest ^f|rest((T r X 'b X v) ^ ] -^li( X i(hj) Zojlp)), 

i=l 

N 

4 | rest ~ JG(a + Yp/2, 6 + 1/2 ^(Y* - X*)'^ - X,)), 

2=1 

N 

rjx | rest ~ G(c + N, d+ 1/2 ^ tr((Xj(hj) - (z 0i l p + zijf))(Xi(hj) - (z 0i lp + z^f))'^), 

2=1 

N 

A.y | rest ~ G(c + iV,d+ l/2y^tr((Xj(hj) - (z 0i l p + zi i f))(X i (h i ) - (z 0i l p + 

2=1 

Zq i I rest ~ -^(/hsoihest) Cr zo;|r-est)> 

a loi\rest = ( a zo + ^ + Sx) X lp) \ 

JV—1 

l^zoilrest. 4 0i |rest( X i( h i) - X nOn) + (* 1JV “ *li)f ~ ^ Z OjHj ± ^p)'(l + E X ) %), 

3 = 1 

N- 1 

4 0 I re st ~ JG(a +(Y-l)/2,6 +1/2^4), 

2=1 

^1* I re,s ^ ~ X (hzii|r-est> 

< 4 |«* = (^ 2 + f / (7fl 1 S + Ex)- 1 f)- 1 , 

Vzulrest = 4i;|rest(( X i(hi) - + S.y) X f), 

N 

a 2 Zl \rest ~ IG(a + Y/ 2, & + 1/2 ^ 4); 

2=1 

rjf | rest ~ G(c + 1, d + (f'Pj"f)/2), and 
A/ | rest ~ G(c + (p — 2)/2, d + (f'P^ f)/2). 

Note, this list does not include a full conditional for the base functions or registered functions 
as their priors are not conjugate. Instead, the base and registered functions are sampled via a 
Metropolis step. 
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A.3 Forcing Registered Time to Equal Standard Time 

1. If the functions are registered so that registered features occur at the average time that they 
appear in the unregistered sample, for all t, t G {U,... ,t p }, the average warping at that 
time point, h.(t) = Y2iLi K(t), is the identity. Over all observed time points this implies 
h. = (h.(t i) = ti,..., h.(t p ) = t p )'. 

2. Generally, after the registration process, this property will not hold, and instead h. = 
(h.(ti) — ti, ..., h.{t p ) = t p )' where tj ^ tj for at least one j G {1,... ,p}. 

3. The goal is to shift the functions so that these average warpings correspond to the correct 
registered times, i.e. h. = (h.(ti) = t \,..., h.{t p ) = t p )'. 

4. (3) implies that after the initial registration, we have the correct registered function values 
over the new set of times, t = (ti,... ,t p )', i.e. we have estimates of 

X,;(h,;) = (Xiihiih )),..., Xi{hi(i p ))y, for all i = l,...,N. 

5. If it is desired, the estimated values of the registered functions at the original time points, 
t, can be obtained by interpolating values between the new set of time points, t. 

For example, after the initial registration process, suppose h.( 2) = 2.25 and h.( 3) = 3.1, where 
2 and 3 are in the set of original time points. We can alter registered time so that /x.(2.25) = 2.25 
and h.( 3.1) = 3.1, as desired. Using the notation above, this implies {2.25,3.1} C t and for all 
i,i = 1 ..., N, from the initial registration, we estimated the following values, Xi(hi( 2.25)) and 
Xi(hi( 3.1)). From these we can estimate X t (h t (3)) by interpolating the values Xi(hi( 2.25)) and 
Xi(hi( 3.1)). 


B 

B.l Adapted Variational Bayes 

Below are the approximate posterior distributions for the adapted variational Bayes estimation 
procedure outlined in Section 3.1. 
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Based on the full conditional distributions found in Appendix A.2, the following approximate 
posterior distributions are updated in each iteration for the registration model that accounts for 
noisy observations: 


?(Xi) 


-4(A t q(X i )) ^g(Xi))) 

9(f) 



9(4) 



q(vx) 


G \ C <l{Vx)i dq(r) X ))i 

Q(^x) 

rv_/ 

G ( c q(X x )i dq{ Ax))> 

q(zoi) 

rs_/ 

N{n q (z 0i )i a q(z 0i )) 

9(4 0 ) 

rv_/ 

^ G i a q(cr^ 0 ), b q ( a | q )), 

9(-n) 

rv_/ 

N(^q(zu), <J q(zu))i 

9(4) 

rv_/ 

IG{aq(oaj, b q ( a 2^), 

q(vf) 

rv_/ 

G {cq(rif)i dq(j 7 /))> and 

9(A/) 

rv_/ 

G i C q( A/); dq^Xf))- 


If the observations are recorded without noise, i.e. we have observations Xj, i = 1... IV as 
described in (19), the following q distributions are omitted, 

?(Xj) for i = 1... N, q((Jy), q(vx), and q(X x ). 

As the q densities are all of known distributional forms, updating these densities is equivalent to 
updating their parameters. First, assuming the data are recorded without noise, for each iteration, 
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the following parameters are updated for the q densities listed above: 


£g(f) 


^9(f) 


[E(°q(*ih + 1 + ^(r/f) P l + A t <?(A f )P 2 > 

2=1 


N 

^<?(f)7(Xj(hj) — , 

i=l 


(7 


2 

g(zod 


dq{z 0 i) 


K«) + 21PbS- 1 1 p )- 1 , 

JV-1 

^(zoi) (-^*(hi) — Xjv(hjy) + (h<?(ziiv) — /7?(zii))/h?(f) — ^ hv(zoj)l{* 7^ j}lp) 

J=1 


1 lp, 


^(zh) = (/ i 9«)+ tr (( S 9(f) +^(f)^(f))7i?S *)) 1 , 

/^?(zii) = ~b ^q(f)lR^ (Xj(hj) ^q(zoi) lp) ) > 

dq(m) = d + 1/2* tr(P x (S, (f) + ^(f)^(f))), 
d ?( A f ) = d + 1/2 * tr(Pj(S, {f) + ^(f)^(f))), 

iV-1 

6 <?(-2 0 ) = b +V 2 E(^od+/4od)< and 
2=1 
N 

& <?«) = & + 1/2 EWziO + ^(zi<))' 

2=1 

Note, these updates are listed in an order that allows the convergence criterion to be calculated. 
Further details on the convergence criterion can be found in the next section. 

For the model where observations are recorded with noise, in all of the updates above Xj(hj) 
and Xjv(hjy) are replaced by /tt 9 ( Xi (hi)) an d A t g (x Ar (h Ar ))) respectively. Additionally, y^X” 1 is re¬ 
placed by (y^X + Sx)' 1 . For each i, /r 9( - X;(h )) is determined by using the update for the mean of 
the q distribution of the unregistered function, /Lt ? ^ Xi ) below, and registering it using the current 
value of hj. In addition to these modified updates, the following additional updates necessary: 
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Sg(Xi) — (/^ 9 (^-)Ip + Hqiilx)^ 3 1 + h<?(Ax)f*2 ) J 

°Y 

Mq(Xj) = ^9(Xj) [/^g(^_)Yj + (/Aj(77x)Pi + /MAx)^ )(/^9(zoi)lp + A f q(^ii)-^'(0-x i ) (^h )])]) 


iV 




6 + 9 E Y *' Y * - 2 K(X ! )Y, + E S «Lb j] + ^( Xj )[i] 2 ), 


i=l 


J=1 


1 ^ 

d <?(vx) = d + 2^ r [5Z (^?( x i) + Mg(x,)A / ' (? (Xi) — 2//. ? p c ,)(/x (? ( 20i )lp + / i (?(zi i )-E'(0- ?7A .)[f(hj )]) 

i=l 


K(* 0i ) ^g(20i))^P^P ^A t 9(«0<)A t 9(2l»)^P-^'(®-»7x)[^(^ :1 * )] 

OEd + / i ?(*n))- E '(«-^)[ f ( h r 1 ) f ( h r 1 ) , ]) p r |«. and 

i 


i N 

d q( Ax) = d + 2 tr [E ( S< ?( x d + M 9 (x i )K(x i ) - 2 M g (x i )(^( 2 0i ) 1 p + 1 )]) / 

2=1 


K(*h) ^ g (ZOi)) d P^P ^ 2/Xg(^ 0i )/X <3 ( 2 ;i i ) lp-ZE7(6»_ A ^ ) [f (h.^ )] 

K%„) + ^ 1< ))-E , «.-w)[F(h,- 1 )f(h- , )'])PJ 


Note, these updates contain terms that cannot be evaluated. For instance, F l ( 6 )_ ]?if )[f(hb 1 )f(h~ 1 ) / ] 
cannot be determined because the approximate distribution of f(h“ 1 ) is unknown. These terms 
can however be approximated. Appendix C.2 provides details of the approximated values used for 
this analysis. 


B.2 Convergence Criterion 

When the functional observations, X = X*, i... N, are recorded without noise, 

E q (e_^) [ log / (X, w, 0_ w ) - logg(6>_ w )] is monitored until the desired threshhold is met, where 
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Eq(g_ m ) [ log /(X, w, 0_ w ) - log q(6- w)] = E q p_ v ) [ log(/(X, w | 0_ w )/(0-w)) - log g(0_ w )] 

= E q(g _ m) [ log /(X, w | 0_ w ) + log /(0_ w ) - log q(0 - w )] 
= £, ( *_ w) [log/(X,w|0_ w )] 

+ E q{e _ m) [log/(f) - logg(f)] 

(N—l) 

+ E ^-w)[log/(-o*) -logg(-oi)] 

(iV) 

+ E ^g(O-w) [ log f(zii) - log q(zii)\ 

i— 1 

+ Eq{ 0 _„) [ log /(<7~ 0 ) - l°g?(^o)] 

+ ^(0-w)[l°g/«) - l°g?«)] 

+ £*(0-w) [log f(v f ) - log q(v f )\ 

+ E q( 0_ w) [log/(A/) - log q(Xf)] ■ 


Now looking at each piece individually, 

^_ w) [log/(X,w|0_ w )] 

N 

= £«<*-.) [X (>og[(2 3 r)-<’ /2 I T ys |- 1/2 ])_ 

2=1 

JV 

+ -Eg(0_ w ) __ (^Ojlp + 1 (Xj(hj) — (^o*lp + ^ljf))] 

2=1 
N 

= (log[(2vr)“ p/2 | 7 r 1 £ |“ 1/2 ]) 

2=1 


+E 


(Xj(h i ) / 7 R S _1 Xj(hj)) - 


2X, : (h J ;) / 7 i? S 1 n q{zoi) l p -2X i (h i y-f R '£ 1 fi q(zu) fj lq{f) + 

( a g(z u ) + ^(zii)) tr ((^9(f) + M 9 (f)Mg(f))7R^ *) + 
2/ i <j(zoi)/ i 9 (zii)-l-pA'R^ A 4 9 (f) — 


IV-1 


IV-1 JV-l 


[ E(^od ^q(z 0i )) + ;ee ^g(zoi)^q(zoj )!{/ lp> 

2=1 2=1 7 = 1 
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E q{ 0 - w) [ log /(f) - log g(f)] = E q{e _ w) - | log 27r + ^ log I T)fP 1 + A/P.: 

E q(0-w) — (tr [fF'(?7/P x +A/P 2 )] + 
rp 1 i 

E q(0- m ) !og 2vr + - log I S «( f ) I J + 


E q(0-™) 2 tr ( ff/S g(f)) _ f/S g (f)AVf) + 

E q(0—w) ~^l J 'q{f) E ‘q(f)^q(f) 

= c+ ^E q(0 _ v) [2 log rjf] + ^E q( e_ m) [(p - 2) log \ f ] - 
-tr((Sg(f) + flq(f)L l ' q (f))(Hq(r 1 f) P 1 + 2 )) _ 

l lo 8 I s ,m I +f • 


where C is a constant that does not change from one iteration to the next. For z 0 = (zoi, • • •, zo(iv-i)) / , 


E q ( 0 - w ) [ log /(z 0 ) - log g(z 0 )] 


For Z! = (z n ,.. .,,2i;vy. 


r, f AT — 1 1 iV — 1 , 2 v-y 1 2 

Eq( 0 . m ) -log 2 tt -— log a Zo - 2^ - —z 0i + 

L z z *=i Z(J ^o 

iV — 1 TV-1 2 

~ 2 ~ log 2vr + —log a q(z0i) + 

N -i x 

77~2 ( z 0 i ~ /h?( 2 o;)) ( 17 ) 

i =i g(^oi) 

TV — 1 o riV — 1 ,1 

— 3 — log ^(*d 0 - e q(0- w) lo S ^ 0 J - ( 18 ) 

1 iV_1 N — 1 

o ^q{~hr) ( Fg(zoi)) ) ^ o • 

z <, zo \ , =1 / z 
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N 


^(0-w)[ lo g/( z i) - logg(zi)] = E q(e _ m ) 


N N t , v-^ 

- — log27r- — \oga Zl - 2^ 


1=1 Z1 


Z li + 


N N 

— log2vr + — logcr 


2 

?(2l) 


N 1 

i=l 9(*l 0 


(^li Eq(zu)) 


N 1 ; 

2 log -^ 9 ( 0 -w) 


' 2 -E n 

N 


N 1 2 

lY og<r 


21 


"* 1 i=l 


N 

+ y 


For 


^9(0-w)[ lo g/(^ o ) _1 °g^( a D] = E 9(6 -w) 


log 


r( a ; 


- (a+ 1) log o-^ -6-5- - 


O' 


20 


log 


h a ^io> 

v* 2 0 ) 


r(a 3(<) ) +(o ’<<’-»> + 1)log<7 » + 


bq(a2 ) o 

v 2 °V, j 


#g(0- w ) [ - (a + !) lo g^ 0 ] “ ^<?(4r) ~ lo S 

(Tz o 


/V-i 0 ) 
b iEI 0 ) 


F( a g(cr|„)) 


+ 


b 

!°g + E q (0_ m ) [(a q{<T 2 o) + 1) log crf o ] + 


For 0^, 


£ g (0_ w )[l°g/( ff2 




“log?^ 


= E. 


9(0-w) 


log r^)- ( “ +1)log<T »- 6 ^- 


21 


log 


h %^\) 
°9(<) 

Fls^lj)) 

, 1 1 


+ ( a g(al 1 ) + 1) l°g ^ + 


= E, 


9(0- 


,)[-(«+!) log - &/',( l ) - ^g 


°9(<) 

F(®g(cr| )) 


+ 


b 

lo § fy^y + F9S0 w ) [(«9(^) + !) log4] + h'li^U'qi^y 
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For r/f, 


E q(o~ w ) [ log f(Vf) ~ log q(Vf)\ 


For A f, 


E, 


?(0-w) 

d 


log 


d c 


+ (c - 1) log rjf - drjf - 


c q(vf) 


log 

log 


r (c 

'• log Ilf - <!,,[, 
d r ' 


1 ^( c q(vf)) 

d c 


log 


q(vf) 


r(c) r(c, (w) ) 

dq(ilf)dq(vf)- 


E q(0-„) [log 7 ?/] - d Eq(v f) + 


E qi g_ w) [logf(X f ) - log<?(A/)] 


E, 


g(0- w ) 


log 


r(c; 


log 

log 


d 


'«(*/) 


r( c g(A / )) 

d C 


d 


+ (c 


p — 2 

2 

c ^/> 


log: 


9( a /) 


r(c) b r(c, (A/) ) 

dq(\f)Pq(\f)- 


- 1) log A/ — d\f — 

+ c — 1^ log A f + dg^x^Xf 

p — 2 

- -^-E q( g_ w) [ log A/] - dn q {Xf) + 


The expression for F' g (e_ w ) [log/(X, w, 0-w) - logg(0_ w )] can be simplified much further by 
combining terms that cancel out. However, in some cases the ability to cancel terms depends on 
the order of the updates. For instance, in the expression, E q (g_ m) [ log /(cr? o ) — log g(cr? 0 )], the terms 

- h l‘ q ( J ) alld h U{<r%)!•.<(■ I ) CanCcl With -IVqtM ( ^KW^od)) fr0m ^-w)[l°g/(Zo)- 

logg(zo)] as long as the parameters of q( z 0 ) are updated before b q ^ a 2 ^. For convenience, we have 
taken account the ordering necessary to compute the convergence criterion in the updates given 
above. Additionally, note all components in this expression that do not change from one iteration 
to the next can be ignored. 
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C.l Functional Data Regularization and Registration 

If instead of the function itself, noisy observations of each unregistered function, X t (t) are ob¬ 
served over a finite number of time points, t = (t\,... ,t p )', we will additionally assume that the 
observations, Yi{tj),j = 1 ,p are iid, N(Xi(tj), cry). Incorporating registration and smoothing 
into a single model has also been considered recently by Raket et. al. [2014], In their paper, each 
registered noisy function, Yfhft )) at time point tj, j — 1,... ,p is composed as follows: 

Y i(hi(tj)) = f(tj) +ri(tj) + ei(tj), 

where fit) is similar to our target function, r t (t) is a function-specific random effect that accounts 
for variation in individual noiseless functions beyond the target function, and e^itf) are iid Gaussian 
noise. 

The advantage of our model is that incorporating individual random effects is unnecessary. 
Noting that the observations are noisy, not the registered functions; smoothing in our model is 
applied to the observations, not to the functions after registration. Under these conditions, vari¬ 
ability in the estimated unregistered, smoothed functions, X t (t), can be looked at separately from 
variability in the estimated registered functions, X,(hj(t)). Appendix C.3 provides an example 
of how treating smoothing as a pre-processing step underestimates variability in the posterior 
distributions of the registered functions. 

In the presence of noisy observations, the following distributions are either altered or added to 
the registration model presented in Section 2, 


Yiitj) | Xi{tj) 

~ N(X i (t j ), 4 ), i = l...N, j = 1,. 

■ ■ ,P, 

(19) 

ij Zu, f, rjx, Ax 

~ N p (zoil + Zijf, T ^x)) i — 1) ■ 


(20) 

£x 

= Vx i + Ax 1 ^ 2 , 



Vx 

~ G(c, d), 



Ax 

~ G(c,d ), and 



4 

~ IG(a,b). 
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The most significant change to the model is that we now include a smoothing penalty in the 
covariance specification for the registered functions. Here specifying Po (a roughness penalty - 
see Section 2 for more details) in the prior distribution for the registered functions establishes 
regularization in these functions. The associated smoothing parameter is Ax- As mentioned 
previously, Pi is required to define Ex as a proper covariance matrix. More details on these 
matrices can be found in Appendix A. As can be seen above, rjx and Ax are considered as 
additional unknown parameters in this hierarchical model. 

In the prior specifications of this model, equation (20) incorporates penalties for both smooth¬ 
ing and registration within the prior for the registered functions. The full conditional distribution 
for each approximated registered function, Xj(hj), when data are noisily observed is the joint 
full-conditional of the unregistered function and the warping function, 

/(Xi(hj) | rest) = /(wj,Xj | rest). 

= /(wj | Xj, resf)/(Xj | rest) 

Instead of drawing from this joint full-conditional directly, we will proceed by first drawing 
from /(Xj | rest) and then given Xj, draw from /(wj | Xj,resf). 

These full conditional distributions are determined in the standard way recognizing that the 
prior distribution for a registered function can be factored into two components. One component 
penalizes lack of registration given the approximated unregistered function, Xj; the other com¬ 
ponent penalizes roughness in the registered function which implicitly penalizes roughness in the 
unregistered function. The roughness penalty is independent of the warping function and therefore 
also of Wj. Specifically, the prior distribution (20) for each Xj(hj), i — 1,..., N, is such that 


/(Xj(hj) | Xj, W j, zoi, zu, f, rjx, Ax) OC 


exp 

exp 


((Xj(hj) - (z 0i l + zijf)) / 7 i?S- 1 (Xj(h l ) - {z 0i 1 + z u i))) 
((Xj(hj) - (zoil + z 1 jf))'S^ 1 (Xj(hj) - ( ZQi 1 + z u f))) ■ 


* 


( 21 ) 

( 22 ) 


Accordingly, the components of the joint distribution of the data and all unknown parameters 
that are dependent on Wj are (21), and the prior on the approximated base functions, 
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w,: oc 


p 

N p - 1(0,+ A“ i P U) )l{ti + y^{t k - t k ~ 1 ) exp(wi(4_i)) = t p }, 

k =2 

i = 1... N. 


( 23 ) 


The resulting full conditional distribution for the approximated functions w ; is such that 
/ (wj | rest ) oc 


exp 


exp 


-- ((Xi(hi) - (zoi 1 + ^f)) / 7R S- 1 (X i (h i ) - (z 0i l + z u f))) 


O ( w i(7„, 1 S + A U) i P u) ) x Wi) 


1{H + _ tk ~ x ) exp(wj(4_i)) = t p }. 


k =2 


This full conditional does not have a known distributional form and can be sampled from via 
a Metropolis step in a MCMC sampler. 

The components of the joint distribution of the data and all unknown parameters that are de¬ 
pendent on Xj are the sampling distribution (19) and equation (22). The resulting full conditional 
distribution is such that 
/(Xj | rest ) oc 


exp 

exp 


2 (Ty 


(Yj - Xj)'(Yj - Xj 


1 


-- ((Xj(hj) - (zoil + 0ijf))'S^(Xj(hj) - (zoi 1 + Zli f))) 


This full conditional distribution also is not of a known distributional form and can be sampled 
from using a Metropolis step. However, as significant features of the unregistered function, Xj, 
should be unchanged by the registration. Smoothness in the registered function, Xj(hj), implies 
the same level of smoothness in the unregistered function Xj. For ease of sampling, we will re-write 
(22) in terms of the unregistered function, Xj so that 
/(Xj | rest ) oc 


exp 


exp 


2 a\ 


(Yj - Xj)'(Yj - Xj 


\ ((X,: - (*« 1 + *i i f(h,- , )))'Ei 1 (X i 


(z 0 il +Zuf(hj ')))) , 


(24) 


which results in a multivariate normal full conditional distribution for Xj. 
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When noisy observations, Y, ; , i = 1 ,... ,N are recorded, the approximation we make in (24), 
while preserving conjugacy, prevents exact variational Bayes updates to be performed on the 
approximate posterior distributions for the following parameters: X,;, i — 1... N, a\, rjx, and Ax- 
Hence, the adapted variational Bayes procedure proposed here requires special handling under 
this data assumption. 

C.2 Adapted Variational Bayes For Noisy Functional Data 

In the traditional variational Bayes and AVB algorithms, it is assumed that 
A( 0 _ fc )(log f(0 k | rest) can be evaluated for all parameters with conditionally conjugate prior 
distributions. For the registration model that accounts for noisy observations, some of these ex¬ 
pectations cannot be determined exactly. However, we can adjust the original AVB algorithm 
further to perform an approximate inference procedure under these conditions. With these ad¬ 
justments, the convergence properties of the adapted variational Bayes algorithm no longer hold. 
Nevertheless, we have found in practice that the adjusted algorithm still results in useful estimates 
for initializing a MCMC sampler. 

Here we look at why the approximate posterior distributions for X,, i = 1,..., N, r] X , and Ax 
cannot be updated properly using the adapted variational Bayes algorithm. In the rn th iteration, 
the following update should be made to log g(Xj), for i = 1... N: 


log[g (m) (X;)] oc £’( 0 _ x ) [log/(X i I rest)}, 


where 


^-x 4 )P°g/(Xi 1 rest)] oc 

1 

X 

i- 

^Xi|rest 

(T2“I P + 

°Y 

A^X-ilresi 

^X;|rest [ 


\/V'—i 


2LV vv * Purest; 


-n 


cr 


Y 


Taking the expectation over the q distributions for all other parameters except for the base 
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functions results to the following updated parameters of g(Xj) = N p (/j , q ( X .), E ? ( X .)), 


^g(Xi) — (^9(4-)^ + hg(A A ')P2 ) ) anC i 

°Y 

= “I" + hg(A A )P2 )(hg(20i)lp +/iq(^ li )-E'(6/_ x ,)[f(h i )])]. 

CTy 

(25) 

In (25), the expectation of f(h^ x ) is unknown. So, the first approximation we will make is that 

^-x i )[ f ( h r 1 )] ~ /W h * 71 )- 

Similarly, to update log q(rj X )' 

log [q {m \vx)} oc E {e _ vx) [logf(i lx \ rest)}, 

where 

E (e- Vx) [ lo g f(Vx I rest)] oc E {e _ vx) [c Vxlrest log r] X - d Vx \ re stVx], 

Crj x \rest ^ T o, and 

1 N 

d vx \rest = d+ - J^tr[(Xi - {zodp + 2 li f(h 1 “ 1 )))(x i - (z 0i lp + ^(hr 1 )))'?-]. 

2=1 

Taking the expectation over the q distributions for all other parameters except for the base 
functions results to the following updated parameters of q(rj x ) = G(c q ( r , x '),d q ( rix )), 


(m) 

C Q<,Vx) 


N + c, 


d 


(m) 

q(vx) 


1 N 

d + 2 tr [(S ( S « + ^(xoK^) - ^(xpKo^) 1 ? + » q (zu) E (e-r lx ){f(K 1 )]y, 

i= 1 

+ ^^z 0i )^ q (z ll )lpE { e. vx )[f(K 1 )]'+ K%h)+^(, i i ))^-,, A )[ f ( h r 1 ) f ( h r 1 )']), and 

N—l N-1N-1 

”1" + h'q(zoi)) + ££ 

i=l 2—1 j =1 


In the expression for ^, i?(o_ r( v )[f (h ?: x )] and £ , (e_^ Y )[f(h i 1 )f(h i 1 ) / ] are unknown. Thus, we 

will make the following approximations, 

^(e_ w )[f(h, r1 )] « ^(h” 1 ) and ^ (e _„ x) [f (hr- 1 )f (h^ 1 )'] « ^ q ( Xi )/N + n q{f) ( hr 1 )M,(f)( h r 1 )'- 
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Note, S ? ( Xi ) does not depend on i. 

The variational Bayes algorithm update for Ax is similar to that of rjx and requires the same 
approximations, 

log[g (m) (Ax)] oc E {e _ Xx) [logf(\ x \ rest)}, 

where 

E (e-x x) [ lo S /( A * | rest)} oc E {e _ Xx )[c Xx \ r e S t log Ax - d Xx \ r esAx\, 
c\ x \rest = Ny —^— J + c, and 

1 N 

d Vx \rest = d+ - ^tr[(Xj - (z 0i lp + ^iif(hj 1 )))(Xj - (z 0i l p + 2 ; H f(hr 1 )))'P 2 ]. 
i =1 

Taking the expectation over the q distributions for all other parameters except for the base 
functions results to the following updated parameters of q( Ax) = G{c q (\ x ), d q (\ x )), 

4 (t) = iv(^)+c, and 
1 N 

(E( S g (X,) + M g (X i )Mg(X i ) 2/A ? ( X .)(/l g ( Zoi )lp + A t q(z li )-£'(0_ Ax )[f(h i )]) 

2— 1 

+ 2fl q ( Zoi )^q( Zli )l p E(g_ Xx )[f(hi + {<7q( Zli ) + ^(zu)) E (9-\ x ) [ f ( h i ^l) 

N—l N—l N—l 

+ ( 2 + ^l(zoi )) + t L q(.Z0i)t i q{z 0 j)^-{j j- 

2—1 2—1 j =1 

Again, in the expression for d^ x y -E’(6>_ A Y ) [f(h” 1 )] and i?(6/_ Aif )[f(h^ 1 )f(h~ 1 ) / ] are unknown. 
Thus, we will make the following approximations, 

E {o ~\ x )[ f ( 1 )] ~ ^(f)(h, rl ) and )[f(h“ 1 )f(h^ 1 )'] « ^ q{Xl )/N + fi q{f) (hi 1 )^ q{q (h; 1 )'. 

Due to these modifications, if noisy observations are observed the convergence properties of 
the adapted variational Bayes algorithm are not guaranteed to hold, and log[/(Y, w; q)} cannot 
be monitored. However, we can monitor convergence for this model as follows. Taking advantage 
of the fact that functional smoothing converges more quickly than functional registration, fix the 
approximated unregistered functions, X ? , i = 1 after a small number of iterations and 

proceed as if they are known. Then, as in the model where the observations are recorded without 
noise, log[/(X, w; q)} can be monitored. 
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Registered Function 95% Credible Interval - Subject 8 


Registered Function 95% Credible Interval - Subject 11 
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Figure C.l: Examples of Credible Intervals for Noiseless Observations . These are two examples 
from the Boys Growth Velocity Data of the tight credible bands that result from registering 
functions that are pre-smoothed. In Figure C.2, the top and lower right illustrations contain the 
credible intervals for these same observations when the noise process is included in the model. 

C.3 The Berkeley Growth Data 

We refer back to the Berkeley Boys Growth Velocity dataset from Section 4.1. In Section 4.1, 
these data were smoothed prior to registration. Here, we again consider these functions with the 
added assumption that they are corrupted by simulated mean zero iicl Gaussian noise, where the 
true noise variance, cry, is .25. 

While it is common in statistical analysis to perform preprocessing steps before applying a 
particular inference procedure, failing to account for the variability in parameter estimates due 
to the preprocessing step leads to overly narrow confidence (or credible) regions. In some cases, 
the effect may be fairly small, and not much is lost in this oversight. However, as we show here, 
the underestimated variability can be substantial when uncertainty in the preprocessing steps is 
ignored. 

In Section 4.2 is an illustration of how closely AVB and MCMC estimates of the registered 
functions adhere to one another. Not only do these estimates tend to be fairly similar when the 
functions are recorded without noise, but the uncertainty in these estimates is minimal. Figure 
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Unregistered - Subject 8 


Registered - Subject 8 




Unregistered - Subject 11 Registered - Subject 11 




Figure C.2: Examples of Credible Bands for the Unregistered and Registered Functions when 
the Noise Process is Included in the Model. Top and Lower Left 95% credible bands for the 
unregistered functions are plotted with the original noiseless functions for subjects 8 and 11. Top 
and Lower Right For subjects 8 and 11, 95% credible bands for the registered functions are 
plotted with the estimate of the registered ‘true’ functions. 
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C.l contains the credible bands for two of the 39 pre-smoothed Boys Growth Velocity Functions. 
These bands are so narrow the width between them cannot be seen. Keep in mind the posterior 
distributions of the registered functions are certainly multi-modal. These credible bands result 
from imposing the restriction that the mean value of the warping functions at each time point 
over the sample must equal that time point. Even with this restriction, the posterior distributions 
can be multi-modal, ffowever, these narrow credible bands reflect that our estimates are in a 
highly probable area of the posterior distribution with minimal local variance. Figure C.2 contains 
credible bands for both the unregistered and registered functions for the same two functions used in 
Figure C.l after noise has been added to the data and accounted for in the model. The variability 
due to noise is substantial. The solid line in all of the plots contains the noiseless version of these 
estimates (or observations in the case of the unregistered functions). 

In addition to providing more accurate credible intervals, this model estimates the noise vari¬ 
ance to be .258 (actual noise variance is .25). This estimate is obtained using uninformative priors 
for both the noise variance, a'y and the associated smoothing parameter Ax- 

This analysis illustrates how regularizing the data prior to statistical analysis for registration 
models severely limits inference for these models. If significant noise is present in the data it is 
prudent to account for the variability in the registration process due to the noise. Our proposed 
hierarchical model is one way to account for this variability. 
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